feat(spam): prune confident-spam from published by themightychris · Pull Request #133 · CodeForPhilly/codeforphilly-ng

themightychris · 2026-06-26T04:09:39Z

Why

The full published import (~31.8k people, ~61% offline-flagged spam) no longer fits the in-memory heap budget on the 4 GB sandbox nodes — a cold boot OOM'd. Rather than double node cost to hold tens of thousands of spam accounts, prune confident-spam from published so the runtime loads only real members. Spam accounts also don't belong in the public, civic-transparency dataset.

Context: this came out of the boot-OOM incident (see #132). Pruning is the durable fix; the memory bump in #131 just bought headroom.

What

specs/behaviors/spam-exclusion.md — the contract: verdict aggregation, prune + cascade scope, idempotency, pipeline ordering.
apps/api/scripts/prune-spam.ts — re-runnable operator script. Reads person-evaluations verdicts from spam-detection (streaming git cat-file, not gitsheets — 54k records), aggregates per person, and cascade-prunes confident-spam from published in one gitsheets transaction.
plans/spam-prune.md — the plan (links perf: investigate in-memory state heap footprint (~60x on-disk-to-heap expansion) #132).
Docs — spam-detection.md + cutover.md updated so the reimport process always runs the prune (with the resurrection-on-reimport ordering warning).

No runtime/loader change — published simply ends up smaller.

The rule

Prune a person iff: ≥1 spam verdict at confidence ≥ 0.8, and no legit verdict at any confidence, and no project membership (real involvement overrides a spam verdict). Cascade deletes their memberships / help-wanted-interest / person tag-assignments; nulls authorId on their project-updates.

Validation (on a throwaway clone)


People	31,832 → 18,203 (pruned 13,629)
Protected by project membership	1
Person tag-assignments removed	1,710
Memberships / updates touched	0 / 0
Idempotent re-run	✅ 0 changes
Boot (pruned) heap / RSS	459 MB / 658 MB @ a 1536 cap (full data OOM'd >2.5 GB)
type-check / lint	✅

Spot-checked 10 pruned accounts: 9 unambiguous bulk-created commercial spam; the 1 with a real project membership is now protected by the membership clause.

Ordering (documented, mandatory)

published is the merge target of legacy-import (full raw snapshot). A re-import/merge re-adds pruned spam, so the pipeline must always end with prune: import → merge → (re-)eval → prune → push.

🤖 Generated with Claude Code

The legacy import carried ~31.8k people, ~61% judged spam by the offline person-evaluations pass. Loading them all exceeded the in-memory heap budget on the standard node size, and spam accounts don't belong in the public civic-transparency dataset anyway. Spec defines: verdict aggregation (prune iff confident spam, no legit, and no project membership), the cascade-prune on `published`, idempotency, and the import → merge → eval → prune → push pipeline ordering. Exclusion happens in the data pipeline, not the runtime loader. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Implements specs/behaviors/spam-exclusion.md — prune confident-spam people from published so the runtime loads only real members and fits the node memory budget without a resize. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Re-runnable script that reads person-evaluations verdicts from the spam-detection branch and removes confident-spam people from published with cascaded deletes of their memberships / help-wanted-interest / person tag-assignments, nulling authorId on their project-updates. Project members are protected (real involvement overrides a spam verdict). Reads the ~54k evaluations via streaming git cat-file (not gitsheets) and applies the prune in one gitsheets transaction. Idempotent; --dry-run reports counts. No runtime/loader change — published simply ends up smaller. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

spam-detection.md: replace the "spam-purge is a future plan / filter on the read path" placeholder with the built prune step (command, rule, cascade, idempotency) and add the mandatory import → merge → eval → prune → push ordering warning — a re-import resurrects pruned spam until prune re-runs. cutover.md: add the prune as a required step after the legacy-import merge in both the T-1 and T-0 sequences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

themightychris and others added 5 commits June 26, 2026 00:08

chore(plans): open spam-prune

c6a71e9

Implements specs/behaviors/spam-exclusion.md — prune confident-spam people from published so the runtime loads only real members and fits the node memory budget without a resize. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(plans): mark spam-prune done (PR #133)

b966800

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

themightychris mentioned this pull request Jun 26, 2026

fix(api): load sheets sequentially at boot to avoid OOM spike #134

Merged

themightychris merged commit b966800 into main Jun 26, 2026
1 check passed

themightychris deleted the feat/spam-prune branch June 26, 2026 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(spam): prune confident-spam from published#133

feat(spam): prune confident-spam from published#133
themightychris merged 5 commits into
mainfrom
feat/spam-prune

themightychris commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

themightychris commented Jun 26, 2026

Why

What

The rule

Validation (on a throwaway clone)

Ordering (documented, mandatory)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant